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APPROXIMATING CONDITIONAL DISTRIBUTION FUNCTIONS 
USING DIMENSION REDUCTION 

By Peter Hall and Qiwei Yao 
Australian National University and London School of Economics 

Motivated by applications to prediction and forecasting, we sug- 
gest methods for approximating the conditional distribution function 
of a random variable Y given a dependent random d- vector X. The 
idea is to estimate not the distribution of Y\X, but that of Y\6 T X, 
where the unit vector 9 is selected so that the approximation is opti- 
mal under a least-squares criterion. We show that 9 may be estimated 
root-n consistently. Furthermore, estimation of the conditional distri- 
bution function of Y, given 9 T X, has the same first-order asymptotic 
properties that it would enjoy if 9 were known. The proposed method 
is illustrated using both simulated and real-data examples, showing 
its effectiveness for both independent datasets and data from time 
series. Numerical work corroborates the theoretical result that 9 can 
be estimated particularly accurately. 

1. Introduction. Estimating a conditional distribution function is an im- 
portant feature of many statistical problems, including, for example, regres- 
sion analysis [see Yin and Cook (2002) and references therein], where a sig- 
nificant problem is prediction of a response for a given value of a multivari- 
ate explanatory variable. Specific applications include those in economics 
and finance [e.g., Foresi and Paracchi (1992), Bond and Patel (2000) and 
Watanabe (2000)], in signal processing and data mining [e.g., Adali, Liu 
and Sonmez (1997)] and a wide range of problems where forecasts are to be 
made from linear or nonlinear time-series [see, e.g., Chapter 10 of Fan and 
Yao (2003), and examples in Section 4 below]. 

In most of these applications one is interested in estimating the condi- 
tional distribution of a scalar random variable Y, given a random d- vector X. 
Even for small values of d > 2, a conventional nonpar ametric estimator can 
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suffer poor accuracy, reflected in slow convergence rates. We suggest a solu- 
tion to this difficulty, based on approximating the conditional distribution 
function of Y given X by that of Y given 9 T X, where the unit d- vector 9 
is selected so that the approximation is optimal under an appropriate least- 
squares criterion. In particular, we avoid the problem of directly estimating 
the conditional distribution function of Y given X. 

Although we are dealing with a dimension-reduction problem, the object 
(i.e., the conditional distribution function) to be estimated is a function of 
both # T x and y, while the index 9 is a global parameter. This rules out the 
possibility of direct application of conventional dimension-reduction ideas, 
such as projection pursuit [e.g., Friedman and Stuetzle (1981), Friedman, 
Stuetzle and Schroeder (1984) and Huber (1985)] and single-index modeling 
techniques [e.g., Powell, Stock and Stoker (1989), Hardle, Hall and Ichimura 
(1993), Ichimura (1993) and Klein and Spady (1993)], which would lead to 
an estimator of 9 depending on y. Instead we define a new criterion in terms 
of an accumulation of squared differences between the joint probabilities of 
(Y,X) and the expected conditional probabilities of Y given 9^X, over a 
large class of subsets; see (2.2) and (2.4) in Section 2 below. Our search for 
the global parameter 9 is based on leave-one-out local linear regression esti- 
mators for conditional distribution functions. Under very mild assumptions 
the resulting estimator 9 is root-ra consistent and asymptotically normally 
distributed. 

Of course, our main purpose in computing 9 is so it can be used in a 
conditional distribution estimator. The root-n convergence rate achieved by 
our estimator is so fast that the estimator of the conditional distribution 
function of Y, given 9 T X, is first-order equivalent to its counterpart that 
would be used if the true value of 9 were known. 

The innovation and novelty of our methodology lie in the fact that we 
use dimension-reduction ideas to solve an important class of nonstandard 
multivariate nonparametric problems. We achieve this end by proposing new 
types of objective functions, with which are associated new theoretical and 
numerical properties. There exists an extensive literature on nonparametric 
estimation of conditional distributions. It includes work of Bhattacharya and 
Gangopadhyay (1990), Sheather and Marron (1990), Yu and Jones (1998) 
and Cai (2002) on conditional quantile regression; Rosenblatt (1969), Hyn- 
dman, Bashtannyk and Grunwald (1996), Fan, Yao and Tong (1996), Bash- 
tannyk and Hyndman (2001) and Hyndman and Yao (2002) on conditional 
density estimation; and Hall, Wolff and Yao (1999) on estimation of condi- 
tional distribution functions. Dimension reduction has been discussed exten- 
sively in the context of regression and density approximation; in addition to 
the references cited earlier we mention the work of Friedman (1987), Jones 
and Sibson (1987), Li (1991) and Posse (1995). 
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This article is organized as follows. In Section 2 we introduce our method 
for estimating 9. Asymptotic properties of estimators 9 and F(-\9 T X) are 
presented in Section 3. Numerical examples involving both simulated models 
and a real-data application are given in Section 4. Technical arguments are 
outlined in Section 5. 

2. Methodology. 

2.1. Motivation. Assume we observe data (Xj, YJ), for 1 < i < n, from the 
distribution of (X, Y). Here X is a d- vector and Y is a scalar. Let denote 
the set of d-variate unit vectors 9 with first nonzero component positive, 
write / for the density of X and let F y \qt x {-\z) represent the distribution 
of Y conditional on 9 T X = z. Given subsets A and B of cl-dimensional space 
and of the real line, respectively, define 

7r e {A,B) = J^F Yld T X {B\6 T x)f(x)dx, ir{A,B) = P(X G A,Y e B). 

If, for some 9 and all x, F Y \gTx ('\9 T x) is identical to the distribution of Y 
given that X = x, then for this 9, ttq(A, B) = tt(A, B) for all .A, B. We suggest 
taking the sets A to be d-variate spheres with differing centers and radii, 
and the sets B to be semi-infinite intervals. 

We can estimate F y \otx using nonparametric methods, permitting us to 
estimate irg(A,B). Of course, we can estimate tt(A,B) as the proportion 
of pairs (Xi,Y{) that lie in A x B. Hence, for each triple (9,A,B) we can 
estimate ng(A, B) and n(A,B) under minimal conditions. (We shall denote 
estimators of ttq and tt by ifg and tt, resp.) Therefore we can check (or, 
more formally, test) the hypothesis that F y \qtx('\9 t x) is identical to the 
distribution of Y conditional on X = x, for all x, by examining the average 
value of {ne{A,B) — tt(A,B)} 2 over a range of sets A and B. 

Although exact equality of tt and ttq is unlikely in practice, the difference- 
based criterion noted above can be used to empirically select 9 such that, in a 
global sense, the distribution of Y given 9' T X = 9 T x is a good approximation 
to the distribution of Y given that X = x. Indeed, the argument in the 
previous paragraph suggests that methodology of this type could be based 
on the difference measure, 

(2.1) Si(9) = JJ{7r e (-Aa,Bf3)-7r(A a ,Bp)} 2 w(a,P)dad(3, 

where w is a weight function and the integral is taken over a parameteriza- 
tion (a, (3) of (A,B). 

The spheres A = A a should be such that the density /qt x of 9 T X is 
bounded away from zero at all points 9 T x with 9 G B and x G A a . Oth- 
erwise, design sparseness problems can arise when nonparametrically esti- 
mating F Y \0Tx- Considerations of this type suggest taking the A a 's to be 
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d-variate spheres whose centers confine them to lie inside a larger, bounded 
region where / is bounded away from zero. Such restrictions are unnecessary 
when considering the intervals B, except that there is little point in giving 
emphasis to sets for which P(Y G B) is low. 

For these reasons, when permitting Br to be the interval (—00, /?) it is 
appropriate to take w(a,(3) in (2.1) to be proportional to the density of Y 
at (3, and to not depend on a. We shall achieve this end empirically, by 
replacing the double integral in (2.1) by a sum of integrals, 

n „ 

(2.2) S{9) = J2 {MA a ,B Y] ) - fc(A a ,B Yj )} 2 da, 

where Bp denotes (— 00, 0\ and the integral is taken over an appropriate set 
of sphere centers and radii. In the future we shall use the notation y instead 
of p. 

When constructing our estimator of F Y \gT X , which is central to our com- 
putation of 7T0, we shall use a "leave-one-out" technique, or more accurately, 
"leave- two-out." Our method will employ the empirical distribution of # T Xj 
as a surrogate for the true distribution of 9 T X, and so, when 9 T Xi appears 
in the argument of Fy\o t x^ we shah 1 omit X{ from the latter. The second 
omission occurs because, as formula (2.2) suggests, we shall validate on Yj 
when constructing our least-squares criterion. Therefore we shall omit both 
the ith and the jth pairs when calculating i*V|0 T x! see (2-3) below. 

2.2. Estimator of 9. With these principles in mind, let h be a bandwidth 
and let K be a kernel function, and define 

' (n-2)h . . 
9 T (Xi — Xi x ) 



Wi V -i-j(0) = K , 
(2.3) x {^(9) - gT( \^ TM_#) \, 



_ j (y\9 T X l ) = l "iv.-i.-iiWPi^v)] 



X 



J2 w h;-i,-j( e ) > ■ 



Write simply F{y\z) for P(Y < y\9 T X = z), and let A be a subset of d- 
variate space. In this notation, F_i_j(y\9 T Xi) is a local linear estimator 
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is an estimator of ng(A,B) when B = (— oo,y]. 

As a rule we take a to be a {d+ l)-vector, its first d components denoting 
the center of A a and the last component, r say, its radius. We suppose that 
r G J = [ri,r2], where < r% < r<i < oo and not both r\ and T2 vanish. In 
the case r\ = r<i the spheres all have the same radius, and here a should 
be interpreted as a d-vector, with integrals over r, in our discussion below, 
ignored. With this interpretation, our account of methodology applies to 
the case where r takes values in the continuum as well as to that where r 
is fixed. Clearly the latter instance can be generalized to the case of a finite 
number of discrete radii. 

One approach is to average over all spheres A a that lie entirely within a 
given, fixed set 1Z. With this in mind, let Q = {a:A a C 1Z} be the set of 
sphere centers (and radii, if J is not degenerate). Write F_j(A,y) for the 
proportion of the n — 1 values of (Xi,Yi), for i ^ j, that satisfy (Xi,Yi) G 
A x (— oo,y]. Put 



The latter represents a particular form of S(6) in (2.2). In practice, the 
integration over a in (2.4) is typically replaced by a sum over a class of 
selected balls; see (4.1) below. In fact the asymptotic theory in Section 3 still 
holds with this discrete version of S(6) if the same replacement is applied 
wherever appropriate, including in condition (3.3). 

We choose 9 to minimize S(9) over 9 G 0. Thus, 9 may be viewed as an 
estimator of #0) the minimizer (over 9 G O) of 



where F(A, y) = P{(X, Y) G A x (— oo, y]}, fy denotes the density of Y and 




(2.4) 




(2.5) 




(2.6) 




A low-dimensional approximation to F Y \x(y\X = x) is therefore F^(y\9 T x) 
where Fg(y\z) is an estimator of P(Y < y\9^ X = 9 T x). Denoting by F a local 
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linear version of F, we define 

(2.7) F e (y\9 T x) = ^pw i (x,6)I(Y i <y)^ j jf>^,0)|, 
where 



h 11 /i 



Our empirical, low-dimensional approximation to = x ) i s taken to 

be -Fg(y|(? T x), and is of course an estimator of P(Y < X = ^Jx). 

2.3. Empirical bandwidth choice — a rule of thumb. Two bandwidths need 
to be chosen: h for estimating 9, and i? for estimating Fg(y\9 x) with # 
given. In such nonstandard problems, conventional bandwidth selection meth- 
ods for nonparametric regression are either tedious to apply (as in the case of 
plug-in methods), or do not facilitate obvious analogies (e.g., cross-validation 
and its variants). Note that with 9 given, estimation of Fs(y\ff^x) has been 
investigated by, among others, Hall, Wolff and Yao (1999). They proposed 
a bootstrap method based on an approximating parametric model to deter- 
mine the bandwidth, which we will adopt for estimating H. Furthermore, 
we outline a similar empirical procedure below for determining h. 

First we fit the linear model 

(2.8) Y i = (3 + f3 T X i +e i . 

Let Pq and $ be the estimators derived by, for example, least squares, and 
let £i,...,e n denote the centered residuals. We shall compute a bootstrap 
sample {Y*, . . . , Y*} from the model 

(2.9) Y* = p +p T Xi + £*, 

where {£*} denotes a conventional bootstrap resample drawn by sampling 
with replacement from {e«}. Then the conditional distribution of Y*, given Xi 
depends on Xi through f3 T Xi alone. Let (3* = (3*{h) be the estimator ob- 
tained in the same manner as 9 but with the data (Xi,Yi) replaced by their 
resampled counterparts (Xi,Y*); see Section 2.2. We choose h to minimize 

(2.10) M 1 (h) = E[0*-(3\\ 2 \{(X t ,Y)}]. 

It is important that the two bandwidths h and H should be different. As 
we shall show in Section 3, optimal performance is achieved if h is of smaller 
order than H. The simulation results reported in Section 4 indicate that 
the bandwidths selected by the bootstrap methods discussed above produce 
estimators with good performance. 
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3. Theory. For simplicity we discuss only the case where the data (Xi, Yi) 
are independent. Analogues of our main results, Theorems 3.1 and 3.2, may 
be derived for dependent data, in particular for sequences of pairs (Xi,Yi) 
that satisfy sufficiently strong mixing conditions. The case of dependence 
will be explored numerically in Section 4. 

Let us first define the vector of derivatives, d, of a function a of 9 € O. 
Let ui, . . . ,uJd-i be orthonormal vectors all perpendicular to 9, put u>is = 
(1 — 5 2 ) 1 / 2 6 + 5uJi for a scalar 5 and set 

k = lim 5~ l {a(iv i s) - a(9)}, 

S— >0 

assuming the limit exists and is finite. Then 

a vector in the plane perpendicular to 9. Similarly we may define the matrix, 
d, of second derivatives of a. 

Let (X, Y) have the distribution of a generic pair (Xi,Yi). We shall assume 
that 

, . the density of (X,Y) has four bounded deriva- 

tives, and all moments of Y are finite. 

The bandwidth h will be permitted to vary within a range, effectively from 
n" 1 / 3 to n" 1 / 4 ; see (3.4) below. If we were confining attention to the lower 
end of this range, then we could reduce the smoothness assumption in (3.1) 
from four bounded derivatives to three derivatives plus a Holder continuity 
condition. In this sense, the smoothness required by (3.1) is excessive. 

Recall that if sphere radii vary in the continuum, then Q denotes a set 
of sphere centers and radii, while if there is a single, fixed radius, then Q is 
a set just of sphere centers. In either case, all spheres in Q are completely 
contained within 1Z; see the definition of Q in Section 2.2. We shall suppose 
that 

1Z is an open, bounded set; the density of X is bounded 
away from zero on 1Z; and the content of Q is nonzero. 

In particular, this and (3.1) ensure that the density of the distribution of 
9 T X is bounded away from zero on the set of points # T x with x £ A C 1Z. 
Assumption (3.2) may therefore be viewed as the analogue of the condition, 
imposed in more standard problems of nonparametric regression, that the 
design density is bounded above zero. 

Conditions (3.1) and (3.2) imply a range of smoothness properties of the 
marginal density /qtx and the conditional distribution F(y\z) = P(Y < y|# T A = z). 
For example, the fcith derivative with respect to 9, of the /^th derivative 
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with respect to z, of either fgT X (z) or F(y\z), is well defined and bounded 
in k\ + &2 < 4, y, 9 G and z = a; for x£lZ. 

Recall the definition of Gg(A,y) in (2.6), and let Gg(A,y) and Go(A,y) 
denote, respectively, the vector of first derivatives and the matrix of second 
derivatives of Gg(A, y) with respect to 6, with (A,y) held fixed. Note that 
#o = argming So(6), where So is defined in (2.5). 

Put 



M{8)= J da j[G e (A a ,y)G e (A a ,y) 



'Q 

- {F(A a ,y) - G e (A a ,y)}G e (A a ,y)}f Y (y)dy, 
a d x d matrix. By assuming that 

8q gives a unique global minimum of Sq(9), and 



(3.3) 



ui M(6q)u > for each nonvanishing vector u> _L 6\ 



0; 



we require an equivalent condition that Sq{&) — ► Sq(0q) at exactly the rate 
|| — ^o|| 2 as 9 ^ 9q. Of the kernel -ftT and bandwidth /i we shall assume that 

K is nonnegative, symmetric and compactly supported, 
, . and has a bounded derivative; and, for some e > 0, h = 
^ ' h{n) satisfies h = C^rT^ 1 / 4 )) and n"^/ 3 )^ = 0(/i) as 

n —> oo. 

The most important aspect of this assumption is that it implies h should lie 
between n _1//3 and n" 1 / 4 , and so should be an order of magnitude smaller 
than a conventional bandwidth for estimating a univariate function by non- 
parametric regression. A conventional bandwidth would be of size n -1 / 5 . 
Let 4>gT X \_A denote the density of 9 T X conditional on X G A, and define 

^(A,x 1 ,y 1 ,y,e) = {I(y 1 <y)-F(y\e T x 1 )} 

(3.5) 

r r , „ cf> e T XlA (e T x 1 )p(xeA) 

x { /(aie ^ — . 

[The ratio in this definition is guaranteed well defined, since P{X G -4) | ?V r x|.A 
fgTx-) Let V denote the Gaussian (i-vection with zero mean and covariance 
matrix equal to that of 



W= da 
IQ 



i;(A a ,X,Y,y,9)G eo (A a ,y)f(y)dy 
{F(A a ,Y) - G 9o (A a ,Y)}G eo (A a ,Y) 



da. 



Let || • || denote the Euclidean metric in d-variate space, and recall that 8 is 
defined to be the global minimizer of S(6) in (2.4). 
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Theorem 3.1. Assume conditions (3.1)-(3.4). Then 9 — > O with prob- 
ability I, and n 1 / 2 M(9 )(9 - 00 ) converges in distribution toVasn^oo. 

To appreciate the implications of this result, let Q 1 - denote the projection 
of 9 into the plane U 1 - that is perpendicular to 9q. (Equivalently, 9 is the 
projection of 9 — 9$ into IT 1 .) The first part of Theorem 3.1 implies that 
||0 — 0o|| — ► with probability 1, from which it follows (since 9 and 0o are 
both unit vectors) that 

(3.6) 0-00 = ^ + 0(110-0011) 

with probability 1. That is, in first-order asymptotic terms, 9 — 9q is com- 
pletely describable through the projection of this vector into the plane per- 
pendicular to $o- 

Note that, by definition of differentiation with respect to 0, the vector Gq 
is perpendicular to 0. It therefore follows from the definition of V that, with 
probability 1, V lies completely in H 1 - . Observe too that, in view of (3.3), 
there is a generalized inverse of Mq = M{6q) (call it Mq) that is well defined 
in n -1 -. It has the property that 

MqM^v = MqM v = v for all v £ IT 1 . 

These results, Theorem 3.1 and (3.6) imply that n 1 ' 2 (0 — 9) converges in 
distribution to MqV. 

Of course, our main purpose in computing 9 is so it can be used in a 
conditional distribution estimator, such as Fq introduced in (2.7). Theo- 
rem 3.2 below shows that the root-re consistency achieved by the estima- 
tor makes that quantity so accurate that, from the viewpoint of first-order 
performance, the estimator F^(y\9 T x) is equivalent to its counterpart which 
would be employed if the value of 0o were known. This result has analogues 
for general choice of the bandwidth used for Fq; they describe a range of 
circumstances where the leading bias and variance terms do not include the 
effect of estimating 0. However, for the sake of simplicity and brevity we 
shall treat only the optimal size of bandwidth. 

The latter size is n -1 / 5 , and when that is employed, Fg o (y\0Qx) converges 
to its limit at rate re _2//5 . We shall show in Theorem 3.2 that the differ- 
ence between F^y^x) and Fe Q (y\9Qx) is then of strictly smaller order 
than n~ 2 / 5 . 

These considerations motivate the following assumption: 

the bandwidth H used to construct Fq has the property 
/„ 7 n that n l /^H is bounded away from zero and infinity as n — > 
oo; and the kernel is nonnegative, symmetric, compactly 
supported and has a bounded derivative. 
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Note that H and h are of different orders, the former being of size n -1 / 5 and 
the latter of smaller order. We shall reduce the stringency of (3.1), assuming 
instead that 

, . the density of (X,Y) has two continuous derivatives, 
^ ' ' and all moments of Y are finite. 

As the following theorem shows, we do not need the full force of the result 
that 6-9 = Op(n~ 1 / 2 ); the convergence rate o p (n 2 / 5 ) suffices. 

Theorem 3.2. Assume (3.2), (3.7), (3.8), that x£lZ, and that 9-9 = 
o p (n~ 2 / 5 ) as n — ► oo. Then for each y, 

F § (y\§ T x) = F eo (y\e^x) + o p (n-y 5 ). 

It follows from the asymptotic normality of local linear regression estima- 
tion [see, e.g., Theorem 1 of Fan, Heckman and Wand (1995), and Remark 4 
of Hall, Wolff and Yao (1999)] that the estimator Fg (y\9^x) is asymptot- 
ically normally distributed with convergence rate n -2 / 5 . By Theorem 3.2 
above, Fs(y\9 T x) and Fg (y\6Qx) have the same asymptotic distribution. 

4. Numerical properties. We approximate the integral in (2.4) by a se- 
ries, 

1 B 

(4.1) S(6) = — J2S(9,Ai), 

i=i 

where the A^s are spheres of radius r contained within 1Z. In practice one 
would select a value of B that permitted the calculations to be completed 
within a reasonable time, and compute estimates for that value as well as 
for substantially smaller ones, say half and three-quarters of the initial B. 
Provided there was little variation in the results, the larger B would be 
appropriate. The results reported in this section show that choice of B has 
little effect on final results. 

In the numerical examples below we searched for 9 (with h fixed) using 
the downhill simplex method; see Section 10.4 of Press, Teukolsky, Vetterling 
and Flannery (1992). Using the Epanechnikov kernel, the bandwidths were 
sought among values hi = 0.1 x 1.2* for i = 1, . . . , 15, based on the boot- 
strap methods outlined in Section 2.3. We used sample sizes n = 200 and 400. 
Each setting was replicated 50 times. Throughout Examples 1 and 2 below 
we took Xij and to be totally independent N(0, 1) random variables. 

Example 1. Here we consider the model 



Yi — 9\Xn + 9iXii + #3^3 + 9^Xi4 + £i, 
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where 9 T = (9\, . . . ,64) = (1,2, 0, 3)/vl4. Thus, the conditional distribution 
of Y, given X=[X U ..., X 4 ) T , is N(6» T Y, 1). We let the radius be r = 1, and 
sphere centers be points (x±, x%, 23, X4), where each Xj ranged over either 
five or seven grid points between —1.5 and 1.5, with spacing 0.75 or 0.5, 
respectively, resulting in B = 625 or B = 2401. 

Figure l(a)-(c) presents boxplots of the inner product 9 T 9, where, re- 
spectively, the bandwidth h was computed by minimizing (2.10), or taken 
equal to the latter value multiplied by 1.5 or 0.7. Since both 9 and 9 are 
unit vectors, 9^9 = 1 if and only if 9 = 9. We see from Figure l(a)-(c) that 
the estimates of 9 become steadily more accurate as sample size increases. 
Moreover, the algorithm is largely insensitive to the bandwidths used in the 
search; the estimates of 9 with the three different bandwidths differ only a 
little. Furthermore, the algorithm is also insensitive to the value of B. 

Figure 1(d) and (e) displays boxplots of the bandwidths h, obtained by 
minimizing M\ in (2.10), and H, defined by the method of Hall, Wolff and 
Yao (1999). As expected, empirical bandwidth is a decreasing function of 
sample size. Note too that selected /i's are in general noticeably smaller than 
the chosen H, which is in agreement with the asymptotic orders of h and H . 

We also calculated values of the local linear estimator defined in (2.7) with 
bandwidth H. Figure 1(f) gives average absolute errors, computed using a 
regular grid (with adjacent points distant 0.05 apart) in the (9 T X, Y)-plane. 
For the sake of comparison we also report the errors for the estimators based 
on the true 9. Clearly, accuracy increases with sample size, and estimators 
based on 9 are less accurate than those based on the true 9. However, the 
deficit due to errors in estimating 9 is not great when n = 200, and is neg- 
ligible when n = 400. Choice of radius r is not critical either; results with 
r = 0.5 and 1.5 are similar to those for r = 1, and therefore are not reported 
here. 

Example 2. Next we consider the model 

Yi = |(sinXji + sin X i2 + sin X i3 + sin X iA ) + e,;. 

Now the conditional distribution of Y given X = {X\, . . . , Xi) T no longer de- 
pends on a linear combination of X. The true value of 9 is (0.5, 0.5, 0.5, 0.5) T ; 
note the symmetry of the model. We selected the spheres in the same way 
as in Example 1. The numerical results are presented in Figure 2, which 
displays a similar pattern to Figure 1 although the estimates in general are 
not as accurate as in Example 1. This is due to the fact that we were es- 
timating the least-squares approximation, in the sense of minimizing (4.1), 
of the conditional distribution of Y given X, rather than the conditional 
distribution itself. Figure 2(a)— (c) shows that the estimation for 9 is still 
accurate, even for the sample size n = 200, and is steadily improved when n 
is increased to 400. 
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(a) Inner product with h 
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(b) Inner product with 1 .5h 



(c) Inner product with 0.7b 




n«200 n-200 MM n-400 
B-625 B-2401 B-«25 B-2401 



n-200 n-200 n-400 n-400 
B-625 B-2401 B-625 B-2401 



n.£00 [1.200 IMDO n-400 
B-62S B-2401 B-625 B-2401 
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Fig. 1. Simulation results for Example 1. Boxplots of the inner product 8 8, with band- 
width h taken equal to (a) ft, (b) 1.5ft, and (c) 0.7ft; and of (d) ft, (e) H , and (f ) average 
absolute errors of estimated conditional distribution of Y given 8 T X with either 8 = 8 
(denoted by "E" ) or 8 equal to its true value (denoted by "T"). 



Example 3. Finally we illustrate our method with {Y t , 1 < t < 176} 
the quarterly growth rates of US real GNP between February 1947 and 
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(a) Inner product with h (b) Inner product with 1 .5h (c) Inner product with 0.7h 
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Fig. 2. Simulation results for Example 2. Panels show the same information as in Fig- 
ure 1. 



January 1991. The data series is plotted in Figure 3. This dataset has been 
analyzed by, for example, Tiao and Tsay (1994). Let X t = (Yt_i, Y t -2) T ■ We 
estimated the value of 6 = (6i,02) T for which the conditional distribution of 
Yt, given 9 T Xt, was the best approximation for the conditional distribution 
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Fig. 3. Prediction intervals for US quarterly GNP growth Xt based on, respectively, three 
different predictors 0.580X t _i — 0.815Xt_2, Xt-i and (X±-i,Xt-2)- 

of Y t given Xt, in the sense that S(9), defined at (4.1), was minimized. 
We first standardized the data Xt. Sphere centers were taken to be the 
points Xt (so that B = n), with radius r = 1. The resulting estimate is 
9 = (0.580, -0.815) T . 

Once 9 was obtained we constructed the adjusted Nadaraya- Watson es- 
timator F(-\z) [see Hall, Wolff and Yao (1999)] of the conditional distri- 
bution of Yt, given (P~ Xt = z. The resulting quantile prediction interval is 
[F" 1 (^a|z) , F~ x (1 — i«|z)], for a E (0, 1). To check on performance we used 
the first 166 data points to estimate 6 and F(-\z), and employed the last 
ten data points to validate the predicted values. Results with a = 0.1 are 
reported in Table 1. Note that with 9 = (0.580, — 0.815) T , the predictor is 
0.580Yj_i — 0.8151^-2 ■ For comparison we also report prediction intervals 
using a single predictor Yt—i, and a two-dimensional predictor (Y^-i) ^-2)- 

All the intervals in the table contain the corresponding true values. Pre- 
diction intervals based on two predictors lt-i and Yt-2 are more accurate, 
in general, than those based on a single predictor Y t -\, since the average 
length of the prediction intervals is reduced from 3.51 to 3.21. It is interest- 
ing to see that the average length of the prediction intervals based on the 
selected single predictor 0.580Yt_i — 0.8151f_2 is 3.22, which is almost the 
same as that based on {Yt-\,Yt-2) ■ Note too that our method does not use 
multivariate smoothing techniques, which are susceptible to the "curse of 
dimensionality." Predictions based on d = 3 and 4 did not lead to significant 
improvements, and therefore are omitted. The absence of improvement is in 
agreement with results of Tiao and Tsay (1994), who proposed nonlinear, 
second-order autoregressive models for this dataset. 
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5. Outlines of technical arguments. 

Outline proof of Theorem 3.1. Our argument has two main stages, 
showing, respectively, that 

(5.1) \\0-e \\ = Op(n £ ~ (1/2) ) for each e > 0, 

s(9) = T + (e- e ) T M (e - e ) - 2(9 - # ) T (vi + v 2 ) 

(5-2) 

+ o p (\\e - e \\ 2 ) + o p (J2 \\o - eo\\ u n- v -t), 

where T does not depend on 9, £ > is fixed, 

Vl = J dF(y) Ge (Aa,y)Zn(A a ,y,9 )da, 

V 2 = dF(y) D eo (a,y)Gg (A a ,y)da, 

1 71 

UAy,o) = -J2&(A,Xi,Yi,y,e)-E{iP(A,X,Y,y,9)}], 



n . 



ip is as in (3.5), and O p (J2 \\0 — 9o\\ u n~ v ~ ( ') denotes a quantity which uni- 
formly in 9 is of order no more than that of the sum of \\9 — 9o\\ u ?i~ v ~^ over 
a fixed, finite set of pairs (it, v), where in each case, u, v > and \u + v > 1. 

To give an appreciation of the origin of the terms which make up the 
O p (- ■ ■) remainder in (5.2), we note that the contributions to the remainder 
come from different steps in a Taylor expansion of S(9). In particular, terms 
of the following orders arise in that way: 

\\9-9 Q \\h 2 , \\9-9 Q \\(nh 3/2 )- 1 n £ , \\9 - ^oll^~ t_(1/2) , 

(5.3) 



9 \\ 2 (nh 3 )-\ (nh 3 ) e - 2 , n"'- 1 , 



Table 1 



True value 


0.58(LY t _i - 0.815X t -2 


Xt-i 


(X t _ 1 ,X t -2 


0.67 


[-0.99,2.32] 


[-0.99,2.32] 


[-0.62,3.11] 


0.89 


[-0.91,2.32] 


[-0.88,2.34] 


[-0.59,2.28] 


0.40 


[-0.99,2.20] 


[-1.56,2.54] 


[-0.86,2.34] 


0.43 


[-0.91,2.34] 


[-0.99,2.32] 


[-0.62,3.11] 


0.09 


[-0.91,2.28] 


[-0.88,2.34] 


[-0.59,2.21] 


0.42 


[-0.99,2.20] 


[-1.56,2.54] 


[-1.17,2.34] 


0.11 


[-0.88,2.32] 


[-0.99,2.32] 


[-0.62,2.32] 


0.36 


[-0.91,2.34] 


[-0.88,2.34] 


[-0.59,2.12] 


-0.40 


[-0.99,2.34] 


[-1.56,2.54] 


[-0.86,2.54] 


-0.65 


[-0.81,2.32] 


[-0.91,2.32] 


[-0.91,2.32] 


Average length 


3.22 


3.51 


3.21 
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where in each case the bound is valid for all e > and some t > 0. Noting 
that, by (3.4), n^ 1 ^ < h < ra-&-(V 4 ) for constants Ci, C2 > 0, and using 
the upper of these bounds when h appears with a positive exponent in (5.3), 
and the lower when h appears with a negative exponent, we see that each 
of the quantities in (5.3) may be written as \\9 — 9q\\ u h~ v ~^ for some £ > 
and some (u, v) such that \u + v > 1. 

More detailed proofs of (5.1) and (5.2) can be found in Hall and Yao (2002). 
To illustrate the use of the regularity conditions (3.1)-(3.4), we mention 
that (3.1) is employed to guarantee adequate smoothness of F when Taylor- 
expanding F(Yj\6 T x) and related functions; that (3.1) and (3.2) together 
ensure that the effective design density is bounded away from zero, which 
allows us to deal with the denominator of F_i_j(Yj\9 T Xi) via a stochastic 
Taylor expansion; that (3.3) guarantees that the minimum of S(9) is at- 
tained in the usual quadratic way, or equivalently that the matrix M(9q) is 
of full rank in the {d — 1) -dimensional space of vectors perpendicular to 9q; 
and that one of the applications of (3.4) was described in the previous para- 
graph. 

Taking 9 = 9 in (5.2), and noting (5.37), we see that the remainder term 
in (5.2) may be written as O p (J2 n~ lyU l 2 ' ) ~ v ~^). Since \u + v > 1 for each 
pair (u,v) contributing to the series, and since £ > 0, then this O p (- ■ •) re- 
mainder equals o p (n _1 ). Theorem 3.1 follows from this form of (5.2), and 
from the fact that n 1 / 2 ^ + V 2 ) converges in distribution to V, the latter 
defined a little before the statement of the theorem. □ 

Outline proof of Theorem 3.2. Let 9 n denote the set of all 6 e 6 
that satisfy \\9 — 9q\\ < <5(n)n -2 / 5 , where 5(n) J. as n — > 00. The theorem 
follows from the following result. 

Lemma. Assume (3.2), (3.7), (3.8) and that x e TZ. Then for each y 

sup \F e {y\9 T x) -F eo (y\9^x)\ = 0p (n" 2 / 5 ). 
eee„ 

We outline the proof of the lemma. Treat Fg as the ratio expressed in (2.7), 
although multiply top and bottom there by (n/i)" 1 [here (nH)^ 1 , since 
we take the bandwidth to be H] in order to ensure that neither the nu- 
merator nor the denominator converges to zero or diverges to infinity. The 
numerator and denominator are now each in the form T1T2 — T3T4, where 
each Tj is linear in functions of the data X, and has a proper limit as n 
diverges. Additively decompose each Tj into its expected value (or mean), 
and the difference between it and its mean. Each mean is of course purely 
deterministic. In the remainder of this section we shall outline the technique, 
starting from this decomposition, for treating T\ and T2] a similar argument 
may be given in the case of T3 or T4. 
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The expected value of T\ or T2 may be written as its ll H — > limit," plus 
a term that equals H 2 multiplied by a function of 6, plus a remainder that 
equals o(H 2 ) uniformly in 9. The U H — ► limit," evaluated at 6, equals the 
same quantity evaluated at #0 rather than at 9, plus a remainder of order 
0{5(n)n~ 2 / 5 } = o(n~ 2 / 5 ), uniformly in 8 G n ; and similarly, the coefficients 
of H 2 (for 6 and 8q, resp.) are identical, up to a term that converges to 
uniformly in 6 G @ n as n — > 00 . These arguments require only Taylor expan- 
sion, and prove that the mean of each of the TVs equals its counterpart when 
6 is replaced by #0, P iUS terms that are of size o(n -2 / 5 ) uniformly in 6 G O n . 
A longer argument [see Hall and Yao (2002)] can be used to show that the 
same property is enjoyed by each Tj — E(Tj), not just by each E(Tj). The 
theorem follows from these properties. □ 

Acknowledgments. We are grateful to an Editor and two reviewers for 
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